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Abstract 

Feature selection is one of the most prominent learning tasks, especially in high- 
dimensional datasets in which the goal is to understand the mechanisms that un- 
derly the learning dataset. However most of them typically deliver just a flat set 
of relevant features and provide no further information on what kind of structures, 
e.g. feature groupings, might underly the set of relevant features. In this paper we 
propose a new learning paradigm in which our goal is to uncover the structures 
that underly the set of relevant features for a given learning problem. We un- 
cover two types of features sets, non-replaceable features that contain important 
information about the target variable and cannot be replaced by other features, 
and functionally similar features sets that can be used interchangeably in learned 
models, given the presence of the non-replaceable features, with no change in the 
predictive performance. To do so we propose a new learning algorithm that learns 
a number of disjoint models using a model disjointness regularization constraint 
together with a constraint on the predictive agreement of the disjoint models. We 
explore the behavior of our approach on a number of high-dimensional datasets, 
and show that, as expected by their construction, these satisfy a number of prop- 
erties. Namely, model disjointness, a high predictive agreement, and a similar 
predictive performance to models learned on the full set of relevant features. The 
ability to structure the set of relevant features in such a manner can become a 
valuable tool in different applications of scientific knowledge discovery. 



1 Introduction 



Feature selection[7 | is one of the most often performed tasks in supervised learning problems, es- 
pecially when the goal is to gain an understanding of the mechanisms that underly some, often high 
dimensional, learning dataset. Such analysis scenarios are typical in scientific knowledge discov- 
ery, with biology providing ample examples. However existing feature selection and classification 
algorithms provide at most a flat list of relevant features, with no further information on the internal 
structure of that feature set. Nevertheless it is now a well known fact that within a set of relevant 
features for a given problem there can be a number of different models defined over different feature 
subsets which nevertheless are of high predictive power ||6]|2]- A typical such scenario appears in 
problems with high levels of feature redundancy. 

In this paper we want to go one step further and uncover the structure underlying the set of relevant 
features for a given learning problem. We will do so by learning within this feature set as many as 
possible structurally dissimilar models, i.e. models defined over different feature subsets of the set 
of relevant features. Nevertheless we will constrain these models to have a very similar predictive 
behavior in terms of the predictions they make, and a very high predictive power, similar to that 
which a model learned on the full set of relevant features would achieve. By learning different 
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models which have these properties we expect that highly complimentary feature sets with respect 
to the target variable are placed together in the individual models, while highly redundant feature 
sets will end up in different models. We will further structure the features used by these basis models 
in two basic feature sets. One feature set will be the non-replaceable features, i.e. features that will 
be systematically present within all the basis models learned. This set of features is critical for 
the accurate description of the target variable and their removal from a basis model would result to 
a loss of predictive power. In addition to that set we will have the set of the compliment feature 
sets of the non-replaceable feature set defined over the different basis models. These compliment 
feature sets have a similar information content with respect to the target variable given the set of 
non-replaceable features; each one of them can be used instead of another without any significant 
change in the predictive behavior or performance. The availability of such a structure can provide 
us with a much better insight to the learning problem that is studied. This has the potential to be 
a game-changing technique especially in problems in which understanding the mechanisms that 
underly the learning problem is what drives the data analysis process. 

To the best of our knowledge there exist no learning approaches that are able to uncover struc- 
tures within a set of relevant feature such as the ones just described. Standard feature selection and 
classification algorithms as already mentioned return only a flat set of relevant feature sets, often ac- 
companied by their relative importance in terms of some ranking score. A rather simplistic approach 
that is often used to structure the set of features relies on the use of pairwise feature redundancies 
estimated through some feature similarity measure. Most often these approaches take the form of 
feature clustering which uses as a feature similarity measure some measure of feature correlation, 
placing like that in the same cluster features with a high degree of pairwise redundancy. Neverthe- 
less, the target of this rather different feature structure is often to provide background knowledge for 
regularizing further the model fitting lfTTl[T5l[T0l . 

A central component in structuring the set of relevant features in the manner described above is 
to come up with a way to learn as many as possible equally good but dissimilar models. In this 
paper we will present a novel multiple model learning algorithm that does exactly that. We will take 
standard objective functions such as the ones used in learning simple linear models and use them 
to simultaneously learn a number of dissimilar models by coupling them with a novel disjointness 
regularization which will force the learned models to use different discriminative features. In order 
to guarantee that all models will be of roughly equal predictive power and equivalent predictive 
behavior we will also regularize them in a manner that will force them to produce very similar 
predictions. We will will demonstrate the utility of the novel learning task on a number of high- 
dimensional microarray classification problems. 

The rest of the paper is organized as follows. In section|2]we will introduce a number of necessary 
definitions which we will use to describe in section l2~2l our approach to learning multiple models 
and the respective optimization problem, in section |23l we show how to solve the latter. In section[3] 
we present our experiments, and we conclude in section|4] 

2 Learning the Structure of a Set of Predictive Features 

The new learning task that we want to define and address is to uncover the structure underlying a set 
of features for a given learning problem. For simplicity in this paper we will learn the structure of a 
relevant feature set. Nevertheless, our approach can be used to uncover the structure of a feature set 
with irrelevant features. However, in this case a sparse learning algorithm with embedding feature 
selection strategy should be used to remove the irrelevant features. 

The type of structure that we wish to uncover is the separation of the relevant feature set to a set 
of non-replaceable features and a set of functionally similar feature sets given the non-replaceable 
set of features. We assume that the original set of relevant features will be either given or we will 
establish it with the help of some feature selection algorithm. In the experimental part of this paper 
we will use SVM with elastic net regularization (EN-SVM) lTT2l and retain only features with a 
non-zero coefficient to determine the initial set of relevant features. 

We denote by X the n x d matrix of learning instances for the given relevant feature set S = 
(si, . . . , Sd)- The z-th row X is the xf £ M. d instance; the vector y = (yi, . . . , y n ) T , V% € {— 1) 1} 
is the vector of class labels, for simplicity we will only consider binary classification. The extension 
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to multiclass classification will be discussed later. To structure the set S we will rely on models that 
will be learned with the help of some linear algorithm. We will use a linear SVM with Z\ regulariza- 
tion lIU, and we will denote a model learned on S by ws — (wi, . . . , Wd), by e(ws) its predictive 
error, and by ti>s( x ) its prediction for the x instance, when it is clear from the context we will omit 
the subscript S which indicates the feature set on which the model was learned. Additionally we 
will make use of the concept of predictive agreement of m models W\, ■ ■ ■ ,w m which we define as 

P(w ■ (x)— — w ■ fx)) 

PA(w u ■ ■ ■ ,w m ,x) = ,J, ^ J T » ( ^_i) ' A ~ LL , where P(iUi(x) == u^-(x)) is the probability 

of that given some instance x the two models Wi and Wj produce the same prediction. Empirically 

we evaluate it on a dataset X t of I instances by P(wi,Wj,Xt) = ^ k=1 l5 ('"''( Xk )''"'j ( Xk ) ^ wn ere 
5(a, b) = 1 if a = b, and otherwise. We will define the dissimilarity of two feature sets A{ and Aj 

by D(A i: Aj) = 1 - D(A i: A,) e [0, 1] and by Di 8 (A 1 , A m ) = Ei ^£ ( f ) i ' Ai) the 

dissimilarity for m feature sets Ai, ■ ■ • , A m . Finally we will call these in feature sets non-trivially 
dissimilar if for any pair feature sets 1 — D(Ai, Aj) < 6, where 8 is a small positive value, e.g. 
0.6. 

In the following section we will provide the definitions of the main concepts that we will be using 
in the problem of uncovering the structure of a set of predictive features. 

2.1 Key Definitions 

We will start by giving the definition of equally good and dissimilar models, EGDM, with the help of 
predictive error and predictive model agreement. We will subsequently use the definition of EGDM 
to define the concept of non-replaceable feature set, NR, followed by the concept of functionally- 
similar set of feature sets conditioned on some feature set, FSSFS. The NR and the set of FSSFS- 
correspond to the structure of the feature set that we wish to uncover. 

Definition 1 Given S a set of relevant features, a set of m models wa x , • • • , WA m , learned over 
non-trivially dissimilar feature subsets Ai of S, Aj C S, we will call the set of m models a set of 
Equally Good and Dissimilar Models, EGDM, if for every model we have \e{wA±) — e(ws)\ < ei, 
and for all models we have 1 — PA(wad ■ ■ • T W A m ,x) < £2, where ei and 62 are some small 
positive values. 

In other words a set of m models is an EGDM if the different models that belong to it are defined 
over non-trivially dissimilar feature subsets, have a predictive error that is almost identical to that of 
the model learned on the full set of relevant features S, and a very high predictive agreement. From 
the definition it is clear that the EGDM models are models of high predictive performance, almost 
the same as that of the full feature set, which use different discriminative features. Note that not all 
learning problems can have EGDM models, but only those which have high, structurally, redundant 
information with respect to the target variable. 

Definition 2 A feature set I C S is non-replaceable, NR, if I = HiAi, i.e. if it is the intersection of 
the feature sets used in the EGDM models. 

The features of NR are not necessarily the most predictive ones. However when we want to max- 
imize the predictive performance their information contribution cannot be brought in by any other 
subset of S. Note that in order to establish the NR of a feature set S we need to discover all EGDM 
models. This is an NP-hard problem. We will approximately learn NR by learning as many as 
possible EGDM models. 

Definition 3 We will call a set of m different feature sets F\, - ■ ■ , F m , |"L =1 Fi — 0, a functionally 
similar set of feature sets conditioned on a given feature set C, and denote it by FSSFS, if the set of 
models wp lU c, ■ ■ • , u>F m uc is on EGDM set. 

The information content that these m feature sets deliver with respect to the target variable is equiv- 
alent. Learning in the presence of the C feature set any feature set of FSSFS can be replaced by 
some other feature set of FSSFS with almost no change in predictive behavior. FSSFS provides us 
with a structure of the feature set in terms of sets of feature sets of similar information content with 
respect to the target variable. This together with the NR set fully structures the feature set S in terms 
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of non-replaceable features as well as sets of features that bring the same information content. The 
availability of such a structure can provide us with a much more complete picture than that a fiat 
list of relevant features can provide. This is especially crucial in data analysis scenarios in which 
gaining an insight to, and understanding the mechanisms that underly and produced the learning 
dataset is what drives data analysis. 

In the following sections we will present an algorithm which learns a set of equivalently good and 
dissimilar models satisfying the definition Q] To the best of our knowledge it the first algorithm of 
that kind. 



2.2 Multiple EGDM Learning 

At the core of our approach we have the learning of multiple equivalent good and distinct learning 
models. Since we will be given a set of relevant features over which we will learn our multiple 
models here we do not need to use any sparsity constrain on the features because we know apriori 
that all of them are relevant and we want to have models of predictive performance that is similar to 
what we can achieve if we use the full feature set. We can make use of standard cost functions found 
in non-sparse linear classification algorithms such as linear SVMs and logistic regression with l\ 
penalty. However we will not learn these models independently of one another. We will learn them 
collectively and force them to be dissimilar by using different discriminative features. One cannot 
help but think a symmetry with multi-task learning 0. In the latter we learn similar models over 
different datasets, here we learn dissimilar models over the same dataset. 

We will control the model dissimilarity through the introduction of a disjointness regularization 
term. More precisely, we define the disjointness regularization term for m models, W\, ■ ■ ■ , w m , 

as: 

n(«>x,-" ,W m ) = \ W i\ T \ W j\=Yl E \ W H W 0l\ W 



This regularization is motivate by a simple observation. If two models are totally disjoint, their 
entry-wise product will be the vector. Based on this observation we regularize the sum of the 
sparsity-inducing l\ norm on the element-wise product of all model pairs. Minimizing this sum will 
push most of the entries of the element-wise model products to be 0; as a result different models 
will select different features. Note that there is no disjointness regularization on the intercepts of the 
models. 

To learn the different models we will use the linear SVM objective function, i.e. we will minimize 
the trade-off of the margin, i.e. the l\ norm of the normal vectors, and the hinge loss error. In addi- 
tion we want the different models to produce similar predictions, to do so we will add an additional 
term which will penalize the prediction dissagreement. We will only constrain the models to pre- 
dict the same class label, an even stronger constraint would force them to produce exactly the same 
output value. The final optimization problem will be created from the combination of the above el- 
ements, i.e. the SVM objective function, the prediction agreement constraints, and the disjointness 
regularization term, and will be: 



m n 



mm 



i—l k—1 k—1 

. rn 

i=l 

s.t. y k (bi +wfx k ) > 1 -e ife ,e 4fc > 0,Vi,fc 

(bi + wjx k ) * (bj + wjx k ) > -Sij k ,Sij k > Q,Vi,j, k 



This optimization problem learns m dissimilar models. We maximize prediction agreement by pe- 
nalizing the prediction disagreement, defined in the second constraint, and we control its importance 
through the Ai parameter. We control the importance of the model dissimilarity through the trade- 
off defined by the A2 and A3 parameters over the model dissimilarity and the t\ model norms. We 
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should emphasize that the use of the l\ model norms has an additional advantage. In combination 
with the model dissimilarity term, as elastic net regularization ifTTl . it forces highly correlated fea- 
tures either to be included all together in the same model or excluded all together. The result is 
that like that we do not discover dissimilar models which would be created simply because highly 
correlated features are placed in different models, producing models with very similar predictive 
performance and on the same time high model dissimilarity. It is this property that guarantees the 
discovery of valuable feature groupings. We will call the optimization problem given in (0 Multiple 
Dissimilar SVMs, MD-SVMs. 

Since all models learned by MD-SVMs should achieve the same target, i.e. minimize the two hinge 
loss errors and the i\ regularization, the result is that they compete against each other to select the 
most useful features. Features that are most important in predicting the target variable and cannot 
be replaced by other features will be present in all models giving rise to the NR feature set. Feature 
sets that are highly complimentary to the NR and structurally redundant between them will end up 
in different models producing the different feature sets of the FSSFS conditioned on the NR set. 

Before describing in the next section how we will solve the optimization problem of MD-SVMs 
we will briefly discuss some related work on dissimilar model learning. In fact there has been 
very limited work on learning dissimilar models and this in a quite different context. |[T6l [9] learn 
dissimilar models for the tree-structured multi-class classification problem. The different models 
are learned over different sub-classification tasks of some given multi-class classification problem. 
Their motivation was that dissimilar classes may have different discriminative features. The authors 
of [ 16 1 proposed the following orthogonal regularization: 

Q(W!, ■■■ , W m ) = ^ Kjj \ wj T Wj I (3) 

ij 

where K%j is the weight for ith and jth models. Minimizing this quantity will make the learned 
models orthogonal to each other. However, orthogonality does not necessarily imply different mod- 
els will select different features, thus the learned models with such a constraint are not necessarily 
disjoint. For instance, the vector [0.5,0.5] is orthogonal to the vector [—0.5,0.5], however, it is 
obvious that the two vectors are not disjoint. [9 | propose the following competition regularization 
term: 

•■• ,w m ) = IIKI + Kllla ( 4 ) 

= ]T (|K||^ + |K-||^ + 2KrK|) 

This is very similar to the terms that control the model disjointness in problem [2] The main dif- 
ference is that here the importance of the i\ norm is fixed with respect to that of |«?i| T \ wj |. In 
problem |2] we control the trade-off through the different A parameters, control which is crucial be- 
cause we need to tune the appropriate model dissimilarity level. 

We can easily extend the MD-SVMs learning problem to multi-class classification problems. Given 
m x c models for a c-class classification problem, Wn, • • • , w mc , the disjointness regularization 
term can be defined as: 

rt(w llr ■ ■ ,w mc ) = Y C^2\wii\) T C^2\wji\) (5) 

Similar to the disjointness regularization in equation ([T), we want to force the different models to use 
different discriminative feature groups. However, the disjointness term now regularizes model dis- 
similarity between the different c model groups that correspond to the c class classification problem. 
The maximizing prediction agreement of the multiple disjoint model learning for multi-class classi- 
fication will depend on which methodology we will use to deal with the multi-class learning (8]|3. 

2.3 Optimization 

Since the prediction agreement constraint in problem© is not convex, the optimization problem (O 
is not convex either. However, if we fix all models except the ith one, then the optimization problem 
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Algorithm 1 MD-SVMs 



Input: X, Y,Ai,A 2 ,A3,ro 
Output: Wis and 

initialize: w°s = 0, b®s = 0, and i = 1 
repeat 

for j = 1, • • • , m do 

Learning (wj 1 , 6* ) by solving the convex problem © 

end for 

i :=i + 1 
until convergence 



of learning Wi, hi becomes: 

n m n 

min C + V^e/c + Ai V" V] + A 2 V" |wi| T + A 3 [[wi^ (6) 

bi,Wi,e.s * — * z — ' z — ' z — ' 

fc=l j = l,i=£j k=l 

s.t. yk(bi + wfx k )>l-e k ,e k >0,\/k 

(bi + wfx k ) * (bj +wjx k ) > -Sijk^ijk > 0,Vj, k 

where C is a constant, the value of which is the sum of constant terms irrelevant with the zth model. 
Fortunately, this is a convex problem that is similar to the optimization problem of EN-SVM 1 14]. 
With some algebra, the objective function of © can be rewritten as: 

n m n 

C + Y^tk + Xi ]T $> iife + ^(A 2 Mwu+XaWwiWl (7) 

fc=l j=l^j k=l l 3,j^i 

Comparing optimization problem (|7|i to EN-SVM we see that the latter has different l\ regu- 
larization weights for different features; for the fth feature the weight of the l\ norm of wu is 
(A 2 J2j j^i \ w ji I)- F rom © we can see how the disjoint regularization works. For example, if some 
models have already selected the Ith feature, the ith model will have a reduced probability of 
including the Zth feature by increasing the weight of the l\ norm for wu. On the same time it will 
increase the probability to include other useful features which were not selected by the other models. 

Since the optimization problem © is convex, we will use the alternating convex optimization 
method to iteratively solve (O. The details of the proposed algorithm are described in Algorithm 
([T]l. At each step, we learn only one of the rn models while the parameters of the rest are fixed. The 
convergence (possible to a local optima) of the alternating convex optimization method is guaran- 
teed |fl~). The main difficulty in optimizing the problem © is the non-differentiability of its objective 
function due to the £1 regularization. As we will demonstrate our approach on microarray classifica- 
tion problem that typically has small number of high dimensional instances, we will solve © by the 
alternating direction method of multipliers ( ADMM) following the work of lfT4l . However, for the 
large scale dataset with thousands of instances and features, the stochastic learning algorithm that 
exploiting the regularization structure, such as Regularized Dual Averaging method 1131 . could be 
an alternative approach to optimizing the problem ©. 

3 Experiments 

In this section we will explore the behavior of our MD-SVMs algorithm on nine high dimensional 
biological datasets. The details of the datasets are given in Table (fl}. All the datasets were pre- 
processed by standardizing the input features. The main goal of the experiments is to examine the 
degree to which the models produced by the MD-SVMs algorithm are EGDM, i.e. the degree to 
which they satisfy the three properties of the EGDMs given in definition [TJ namely the disjointness 
property, the high predictive agreement property, and the similar predictive power to a model learned 
on the complete set of relevant feature S. To acquire the latter we use the EN-SVM algorithm, taking 
special care to avoid any information leakage between training and testing as we will see later in the 
full description of the experimental setup. 
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Table 1: Examined datasets. 



Datasets 


# Sample 


# Feature 


# Class 


Datasets 


# Sample 


# Feature 


# Class 


Lung 


39 


1971 


2 


Male vs. Female 


134 


1524 


2 


Breast 1 


60 


1368 


2 


CNS 


60 


7129 


2 


Breast 2 


58 


3389 


2 


Leukemia 


72 


7129 


2 


Breast 3 


49 


7129 


2 


Ovarian 


253 


771 


2 


Colon 


62 


2000 


2 











To evaluate the degree to which the multiple models have a predictive performance which is com- 
parable to that of the single model learned over the S set we will compare them against a standard 
single linear S VM model learned on S. In addition we also want to examine the information content 
of the NR and FSSFS feature sets that are established as a result of the application of the MD-S VMs. 
To do so we will use the MD-SVMs to establish these feature sets and subsequently train over them 
a standard linear SVM. 

For the EN-SVM we select values of the parameters that determine the importance of the l\ and 
i\ norms from the sets {0.1, 1, 10} and {1, 10, 100} respectively using a two-fold inner Cross- 
Validation (CV) on the training set. We set the value of the parameter of the l\ norm of linear 
SVM that we use to estimate the predictive power of the different feature sets to the value of the 
respective EN-SVM parameter. The MD-SVMs algorithm has three hyper-parameters, problem ©. 
To reduce the computational burden we set the value of Ai to one which is large enough to achieve 
high prediction agreement between models. We tune the remaining two parameters that control the 
model disjointness, i.e. A2, A3, as well as the number of models m using inner 2-fold CV. A large 
^ ratio corresponds to more dissimilar models. We select A3 from {0.1, 1, 10, 100} and the value 
of A2 from {3, 5, 7, 10} * A3. The number of models m is selected from {1, 2, 3, 4, 5}. 

MD-SVMs needs to be trained on the set of relevant features S which we establish through EN- 
SVM. Given a training and a testing set, tr and ts respectively, of some fold we select the best 
parameter setting for EN-SVM, \* E , on tr with two-fold inner CV. We now tune the parameters of 
MD-SVMs also by two-fold inner CV on tr where in each fold the relevant feature set S is given by 
the application of EN-SVM with the A^* parameter setting that tuned on the tr* of this fold. Once 
we select the appropriate setting for the MD-SVMs we reapply it on the tr set to produce the m 
models which we then test on the ts set. 

We should note here that the objective function of MD-SVMs uses both the predictive performance 
and the model dissimilarity which are in an antagonistic relation, i.e. higher dissimilarity most often 
leads to lower predictive performance. This means that if we use only the classification error to guide 
the parameter selection for MD-SVMs in the inner CV most often we will arrive to configurations 
that have the smallest model disjointness. However we would still like to have a certain tolerance 
for model dissimilarity, since we want to get diverse models. In order to achieve that we use a trade- 
off between the classification error and the model dissimilarity to select the best parameter setting. 
More precisely the evaluation quantity that will drive the parameter selection for MD-SVMs is now: 

z = (1 + Dis * <r%) * P (8) 

where P is the average accuracy of the learned models, and Dis is the average pairwise model 
dissimilarity estimated by the inner 2-fold CV. We select the parameter setting that optimizes z. a 
controls the trade-off between accuracy and dissimilarity, with larger values of a favoring mode dis- 
similar models. As we are learning EGDMs, here we set a — 2 to make their predictive performance 
similar to that of the model learned on the S feature set. 

The Wi& and biS parameters of the m models that will be learned by MD-SVMs are initialized to 
—w and —b, where w and b are ones learned by EN-SVM. Since with MD-SVMs we learn m 
models, we use as its predictive performance the average predictive performance of the m models. 
To estimate the predictive performance for each dataset we generated 10 random splits to training 
and testing. In each split, 80% of the instances were used for training and the rest for testing. 

We report the accuracy resuls in Table|2] The models of MD-SVMs have the same predictive perfor- 
mance to that of the single SVM learned on the S feature set in six of the nine datasets, while for the 
three remaining, Breast 1 , Male vs Female, and CNS, their performance is very similar. For the sin- 
gle SVM models learned on the NR and FSSFS feature sets, we see that their predictive performance 
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Table 2: Accuracy results and statistics on the properties of the MD-S VMs models. 



Datasets 




SVM 






MD-SVMs 


















FSSFS 


NR 


Accuracy 


# Models 


Agreement 


Dis. Score 


Lung 


77.14±15.36 


68.62±12.54 


60.00±34.21 


77.40±12.58 


3.50±1.27 


0.92±0.I0 


0.65 ± 0.32 


Breast 1 


75.83±9.17 


71.94±10.10 


65.83±13.86 


73.53±8.78 


3.00±1.15 


0.91 ±0.08 


0.54 ± 0.34 


Breast 2 


89.09±5.75 


85.41±5.61 


82.73±10.88 


88.82±6.53 


3.50±1.35 


0.97±0.03 


0.41 ± 0.26 


Breast 3 


54.44±13.30 


55.39±10.20 


28.89±33.62 


58.24±11.71 


3.90±0.99 


0.78±0.17 


0.66 ± 0.45 


Colon 


85.00±10.24 


83.56±9.79 


75.00±28.60 


85.56±7.54 


3.10±1.29 


0.95±0.05 


0.69 ± 0.32 


Male vs. Female 


90.00±3.72 


83.31±4.86 


63.08±19.38 


86.11±5.89 


3.00±0.94 


0.86±0.09 


0.78 ± 0.20 


CNS 


75.00±13.61 


71.39±10.23 


46.67±32.68 


72.08±9.57 


2.90±0.99 


0.84±0.11 


0.71 ± 0.34 


Leukemia 


97.14±3.69 


96.51±4.20 


55.00±47.62 


97.29±3.26 


4.00±0.82 


0.99±0.01 


0.69 ± 0.32 


Ovarian 


98.20±1.99 


96.44±2.21 


96.60±2.99 


98.15±1.49 


2.60±0.97 


0.99±0.01 


0.43 ± 0.17 



Table 3: Relative cardinalities of the features found/used in the FSSFS, NR, and MD-SVMs, with 
respect to the cardinality of the £ feature set. 



Datasets 


FSSFS 


NR 


MD-SVMs 


Lung 


31.76±14.64 


9< 


22.48±31.59 


% 


54.24±22.03 % 


Breast 1 


30.64±16.28 


% 


39.05±34.06 


% 


69.69±18.90% 


Breast 2 


36.61±14.98 


% 


39.96±26.86 


% 


76.57±13.94% 


Breast 3 


25.1 1± 17.72 


% 


29.21±45.89 


% 


54.32±31.88% 


Colon 


36.04±15.91 


% 


22.02±33.75 


% 


58.06±18.57% 


Male vs. Female 


39.42±8.65 1 


t 


11.56±18.81 


% 


50.97±14.55% 


CNS 


36.46±14.36 


% 


19.28±31.73 


% 


55.74±22.80 % 


Leukemia 


31.13±12.06 


% 


18.07±31.00 


% 


49.20±24.37 % 


Ovarian 


26.01±9.86< 




49.85±19.73 


7, 


75.86±1 1.05 % 



is always worse than that of MD-SVMs. This indicates that these feature sets contain predictive in- 
formation that is complementary and should be used in the same model, as it is done by MD-SVMs, 
and not independently. In terms of the prediction agreement of the models learned by MD-SVMs, 
we see that this is quite high, more than 85% with the exception of the Breast 3 dataset for which 
it is around 78%, dataset for which the predictive performance of all the methods was close to that 
of the default classifier. In terms of the number of features that the models of MD-SVMs use these 
range for the different datasets from 50% to 76% of the features of the S feature set, Table [3] The 
number of core features, i.e. the cardinality of the NR set, ranges over the different datasets from 
1 1% to almost 50% of the features of the S feature set. The average size of the FSSFS feature sets 
ranges from 25% to 39% of the features of S. 



4 Conclusion 

Motivated by the limitation that current feature selection algorithms only provide a flat list of rele- 
vant features with no further information on the internal structure of that feature set we propose a 
new learning paradigm in which we try to uncover the structure that underlines the set of relevant 
features of some learning problem. We do so by learning over this relevant feature set as many as 
possible equally good and dissimilar models, i.e. models that have a very high predictive power, 
high predictive agreement, and are defined over different subsets of the set of relevant features. 
These models structure the set of relevant features in a set of non-replaceable features, i.e. features 
that are always present over all the models, and to a set of functionally similar features sets which 
can be used interchangeably with no loss of predictive performance given the set of non-replaceable 
features. This type of feature structure can be extremely valuable for many application domains in 
which what drives the analysis process is understanding the mechanisms that underly the learning 
dataset, a scenario that is typical in scientific knowledge discover problems. In order to achive this 
kind of feature structure we presented a novel multiple model learning algorithm which among other 
things controls the model dissimilarity, in terms of the features that these models use, as well as the 
predictive model agreement. We demonstrate its ability to learn equally good and dissimilar models 
on a number of high dimensional microarray classification problems. 

There is considerable work that needs to be done in order to fully explore and exploit the possibilities 
that the new learning paradigm that we propose opens as well as to understand better its behavior. 
We want to extend it to non-linear models, e.g. kernel methods, in order to discover non-linear 
distinct feature groups. We also want to constraint model disjointness in a more meaningful manner 
typically using background knowledge on feature dependencies and interactions. 
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